Skip to content

feat: expose ParameterServer metas via HTTP for cross-process join#90

Open
ddmm2020 wants to merge 4 commits into
MoonshotAI:mainfrom
ddmm2020:feat/expose-metas-endpoints
Open

feat: expose ParameterServer metas via HTTP for cross-process join#90
ddmm2020 wants to merge 4 commits into
MoonshotAI:mainfrom
ddmm2020:feat/expose-metas-endpoints

Conversation

@ddmm2020

@ddmm2020 ddmm2020 commented Jun 15, 2026

Copy link
Copy Markdown

Expose ParameterServer metas over HTTP for cross-process P2P join

Let a new started inference replica pull weights from an existing replica over RDMA instead of reloading from disk. Replicas started by checkpoint-engine already hold pinned CPU weight buffers registered with the mooncake P2PStore; these endpoints expose that metas so a new replica can RDMA-pull directly.

Changes

  • api.py: add GET /v1/metas (returns ps.get_metas() as JSON) and POST /v1/metas (feeds JSON into ps.load_metas()). Serialized via pydantic TypeAdapter, reusing the existing schema.
  • examples/update.py: join() now reads metas from either a local file (--load-metas-file) or an HTTP URL (--metas-url), then pulls the weights into the local vLLM from RDMA.
  • tests/test_api.py: tests covering JSON round-trip, PS-error propagation, bad-input rejection, schema mismatch, and GET→POST consistency.

Verified: end-to-end on a 2-node job — 14.5 GiB Qwen2.5-7B transferred from main to elastic in 1.5s over RDMA (4× mlx5_bond HCAs), vs ~6.5s over TCP fallback.

Add two HTTP endpoints (GET/POST /v1/checkpoints/{name}/{metas,load-metas})
and a standalone `python -m checkpoint_engine.join_cli` entrypoint, so a
new ParameterServer instance can join an existing P2P weight world over
mooncake RDMA without re-reading checkpoints from disk.

Motivation: in elastic-rollout setups (e.g. mshrl), a long-running training
job already holds pinned CPU weight buffers registered with the mooncake
P2PStore. Newly-started inference replicas should be able to pull these
weights over RDMA instead of re-converting the checkpoint from disk.

Changes:
* api.py: GET /v1/checkpoints/{name}/metas returns pickle.dumps(ps.get_metas())
  as application/octet-stream; POST /v1/checkpoints/{name}/load-metas accepts
  the same bytes and feeds them into ps.load_metas(). Bad pickle is rejected
  with 400; PS errors are surfaced as 500.
* join_cli.py: `python -m checkpoint_engine.join_cli` -- the join() flow from
  examples/update.py, packaged as a first-class CLI under the published
  package so consumers can invoke it without checking out the source tree.
  Accepts metas from either a local pickle file or a remote HTTP URL.
* tests/test_api.py: 6 CPU-only tests covering pickle round-trip, ps-error
  propagation, bad-input rejection, and a GET-then-POST chain that validates
  the new endpoints are mutually consistent.

Verified end-to-end on a 2-node launchpad job: 14.5 GiB Qwen2.5-7B weights
transferred from main to elastic in 1.49s over real RDMA (4 mlx5_bond HCAs)
vs 6.49s over TCP fallback in environments without RDMA passthrough.
@HubertZhang HubertZhang self-requested a review June 24, 2026 13:42
Comment thread checkpoint_engine/api.py Outdated
Replace pickle with pydantic TypeAdapter(dict[int, MemoryBufferMetaList])
for the metas wire format across the HTTP endpoints, join_cli, and
examples/update.py. This reuses the existing pydantic schema (torch.dtype
/ torch.Size already have serializers in data_types.py), removes the
arbitrary-code-execution risk of pickle.loads on request bodies, and makes
the metas self-describing for cross-language consumers.

- api.py: GET /metas returns application/json; POST /load-metas validates
  via validate_json and returns 400 on ValidationError (was broad except).
- join_cli.py / examples/update.py: read/write metas as JSON; document the
  --metas-url HTTP path alongside --load-metas-file.
- tests/test_api.py: use real MemoryBufferMetaList fixtures; add a
  schema-mismatch case (valid JSON, wrong shape -> 400).
@ddmm2020 ddmm2020 force-pushed the feat/expose-metas-endpoints branch from d083f34 to f3392da Compare June 25, 2026 04:58
Comment thread checkpoint_engine/api.py Outdated
media_type="application/json",
)

@app.post("/v1/checkpoints/{checkpoint_name}/load-metas")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the same url like "/v1/checkpoints/{checkpoint_name}/metas" to make this http api more restful

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these metas interface should not include checkpoint_name param. Try to simplify it as /v1/metas

Comment thread checkpoint_engine/api.py Outdated
)

@app.post("/v1/checkpoints/{checkpoint_name}/load-metas")
async def load_metas(checkpoint_name: str, raw: Request) -> Response:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a more elegant method. You can use req directly in the func signature like

    async def load_metas(checkpoint_name: str, metas: dict[int, MemoryBufferMetaList]) -> Response:

then we can directly use wrap_exception(lambda: ps.load_metas(metas))

Comment thread checkpoint_engine/join_cli.py Outdated
@@ -0,0 +1,173 @@
"""checkpoint_engine.join_cli

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there're too many duplicated codes compared to examples/update.py. Is it necessary to add this join_cli.py file?

刘伟健 added 2 commits June 26, 2026 11:30
join_cli was a duplicate of examples/update.py:join() with an extra
--metas-url flag. Add the flag to update.py, remove join_cli.
The checkpoint_name path param was never read — ps.get_metas() and
ps.load_metas() act on a single global field. Drop it from the URL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants